---
name: gtars
description: A high-performance Rust toolkit (with Python bindings and a CLI) for genomic interval analysis; use it when you need fast overlap queries, coverage track generation, genomic tokenization for ML, reference sequence verification, or fragment processing.
license: MIT
author: aipoch
---
> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)

## When to Use

- **Overlap and set operations on genomic intervals** (e.g., peak/promoter overlap, variant annotation, shared-feature detection).
- **Coverage track generation** from interval-like inputs (e.g., ATAC-seq/ChIP-seq/RNA-seq coverage for visualization in genome browsers).
- **Machine-learning preprocessing** where genomic regions must be converted into discrete **tokens** (e.g., Transformer-style models, geniml-style pipelines).
- **Reference sequence management and verification** (e.g., subsequence retrieval, digest calculation aligned with GA4GH refget concepts).
- **Single-cell fragment workflows** (e.g., splitting fragments by barcode/cluster, scoring fragments against reference region sets).

## Key Features

- **Rust performance** with low overhead; designed for large genomic datasets.
- **Python bindings** for integration into analysis notebooks/pipelines.
- **CLI tooling** for batch processing and shell workflows.
- **Fast overlap detection** via IGD-style indexing and interval operations.
- **Coverage track generation** (WIG/BigWig workflows via the `uniwig` functionality).
- **Genomic tokenizers** for ML-ready representations of genomic regions.
- **Reference sequence utilities** (FASTA-backed stores, subsequence retrieval, digesting).
- **Fragment processing and scoring** for common single-cell genomics tasks.

> Additional module-specific guidance may be available in: `references/overlap.md`, `references/coverage.md`, `references/tokenizers.md`, `references/refget.md`, `references/python-api.md`, and `references/cli.md`.

## Dependencies

- **Python package**: `gtars` (version not specified in the source document)
- **Rust toolchain** (for CLI install): `cargo` (version not specified)
- **Rust crate**: `gtars = "0.1"` (as shown in the example)

## Example Usage

### Python: overlap analysis workflow (runnable)

```python
import gtars

# Load two region sets
peaks = gtars.RegionSet.from_bed("chip_peaks.bed")
promoters = gtars.RegionSet.from_bed("promoters.bed")

# Find overlaps (peaks that overlap promoters)
overlapping_peaks = peaks.filter_overlapping(promoters)

# Export results
overlapping_peaks.to_bed("peaks_in_promoters.bed")
```

### CLI: generate coverage tracks (runnable)

```bash
# Generate WIG coverage at a given resolution
gtars uniwig generate --input atac_fragments.bed --output coverage.wig --resolution 10

# Generate BigWig coverage for genome browser visualization
gtars uniwig generate --input atac_fragments.bed --output coverage.bw --format bigwig
```

### Python: ML tokenization (runnable)

```python
import gtars
from gtars.tokenizers import TreeTokenizer

# Load regions and build a tokenizer from BED
regions = gtars.RegionSet.from_bed("training_peaks.bed")
tokenizer = TreeTokenizer.from_bed_file("training_peaks.bed")

# Tokenize each region into a discrete representation
tokens = [tokenizer.tokenize(r.chromosome, r.start, r.end) for r in regions]

print(tokens[:5])
```

## Implementation Details

- **Interval overlap & indexing**: Overlap queries are designed around an IGD-like index to accelerate repeated interval queries (build once, query many). Typical parameters are chromosome, start, end; results are overlapping intervals or derived set operations.
- **Coverage generation (`uniwig`)**: Produces coverage tracks from interval/fragments input. Common knobs include output format (e.g., WIG vs BigWig) and resolution/binning for track granularity.
- **Tokenization**: Tokenizers (e.g., `TreeTokenizer`) map genomic coordinates to discrete tokens suitable for ML pipelines. Token vocabularies are commonly derived from a BED-defined training region universe.
- **Reference sequence store**: FASTA-backed reference access supports subsequence retrieval and digesting/verification workflows aligned with refget-style usage.
- **Fragment workflows**: Fragment splitting and scoring operate on fragment-like inputs (often TSV/BED-style) and can be used for barcode/cluster partitioning and enrichment-style scoring against reference region sets.